Unsupervised Learning: Trade&Ahead¶

Marks: 60

Context¶

The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.

It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.

Objective¶

Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.

Data Dictionary¶

  • Ticker Symbol: An abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market
  • Company: Name of the company
  • GICS Sector: The specific economic sector assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • GICS Sub Industry: The specific sub-industry group assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • Current Price: Current stock price in dollars
  • Price Change: Percentage change in the stock price in 13 weeks
  • Volatility: Standard deviation of the stock price over the past 13 weeks
  • ROE: A measure of financial performance calculated by dividing net income by shareholders' equity (shareholders' equity is equal to a company's assets minus its debt)
  • Cash Ratio: The ratio of a company's total reserves of cash and cash equivalents to its total current liabilities
  • Net Cash Flow: The difference between a company's cash inflows and outflows (in dollars)
  • Net Income: Revenues minus expenses, interest, and taxes (in dollars)
  • Earnings Per Share: Company's net profit divided by the number of common shares it has outstanding (in dollars)
  • Estimated Shares Outstanding: Company's stock currently held by all its shareholders
  • P/E Ratio: Ratio of the company's current stock price to the earnings per share
  • P/B Ratio: Ratio of the company's stock price per share by its book value per share (book value of a company is the net difference between that company's total assets and total liabilities)

Importing necessary libraries and data¶

In [2]:
# this will help in making the Python code more structured automatically (good coding practice)
#%load_ext nb_black

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set the Seaborn default theme
sns.set_theme()

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)

# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to compute distances
from scipy.spatial.distance import cdist, pdist

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# to perform PCA
from sklearn.decomposition import PCA

import warnings

# Suppress all warnings
warnings.filterwarnings("ignore")
In [3]:
# Connecting google drive to google colab
from google.colab import drive

drive.mount('/content/drive')
Mounted at /content/drive
In [4]:
# Loading stock data
stockdata = pd.read_csv('/content/drive/My Drive/Colab_Notebooks/unsupervised_learning/project_7/stock_data.csv')

stockdata.head(10)
Out[4]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 AAL American Airlines Group Industrials Airlines 42.349998 9.999995 1.687151 135 51 -604000000 7610000000 11.39 6.681299e+08 3.718174 -8.784219
1 ABBV AbbVie Health Care Pharmaceuticals 59.240002 8.339433 2.197887 130 77 51000000 5144000000 3.15 1.633016e+09 18.806350 -8.750068
2 ABT Abbott Laboratories Health Care Health Care Equipment 44.910000 11.301121 1.273646 21 67 938000000 4423000000 2.94 1.504422e+09 15.275510 -0.394171
3 ADBE Adobe Systems Inc Information Technology Application Software 93.940002 13.977195 1.357679 9 180 -240840000 629551000 1.26 4.996437e+08 74.555557 4.199651
4 ADI Analog Devices, Inc. Information Technology Semiconductors 55.320000 -1.827858 1.701169 14 272 315120000 696878000 0.31 2.247994e+09 178.451613 1.059810
5 ADM Archer-Daniels-Midland Co Consumer Staples Agricultural Products 36.680000 -12.017268 1.516493 10 49 -189000000 1849000000 2.99 6.183946e+08 12.267559 7.496831
6 ADS Alliance Data Systems Information Technology Data Processing & Outsourced Services 276.570007 6.189286 1.116976 30 25 90885000 596541000 8.91 6.695185e+07 31.040405 129.064585
7 AEE Ameren Corp Utilities MultiUtilities 43.230000 2.174424 1.124186 9 14 287000000 636000000 2.60 2.446154e+08 16.626923 -0.719497
8 AEP American Electric Power Utilities Electric Utilities 58.270000 2.371753 1.068485 11 9 13900000 2052300000 3.13 4.218978e+08 18.456543 -3.022649
9 AFL AFLAC Inc Financials Life & Health Insurance 59.900002 3.027181 1.048295 14 99 -308000000 2533000000 5.88 4.307823e+08 10.187075 -1.883912

Data Overview¶

  • Observations
  • Sanity checks
In [5]:
stockdata.shape
Out[5]:
(340, 15)

There are 340 rows and 15 columns in the stock dataframe provided by Trade&Ahead.

In [6]:
# viewing a random sample of the dataset
stockdata.sample(n=10, random_state=1)
Out[6]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
102 DVN Devon Energy Corp. Energy Oil & Gas Exploration & Production 32.000000 -15.478079 2.923698 205 70 830000000 -14454000000 -35.55 4.065823e+08 93.089287 1.785616
125 FB Facebook Information Technology Internet Software & Services 104.660004 16.224320 1.320606 8 958 592000000 3669000000 1.31 2.800763e+09 79.893133 5.884467
11 AIV Apartment Investment & Mgmt Real Estate REITs 40.029999 7.578608 1.163334 15 47 21818000 248710000 1.52 1.636250e+08 26.335526 -1.269332
248 PG Procter & Gamble Consumer Staples Personal Products 79.410004 10.660538 0.806056 17 129 160383000 636056000 3.28 4.913916e+08 24.070121 -2.256747
238 OXY Occidental Petroleum Energy Oil & Gas Exploration & Production 67.610001 0.865287 1.589520 32 64 -588000000 -7829000000 -10.23 7.652981e+08 93.089287 3.345102
336 YUM Yum! Brands Inc Consumer Discretionary Restaurants 52.516175 -8.698917 1.478877 142 27 159000000 1293000000 2.97 4.353535e+08 17.682214 -3.838260
112 EQT EQT Corporation Energy Oil & Gas Exploration & Production 52.130001 -21.253771 2.364883 2 201 523803000 85171000 0.56 1.520911e+08 93.089287 9.567952
147 HAL Halliburton Co. Energy Oil & Gas Equipment & Services 34.040001 -5.101751 1.966062 4 189 7786000000 -671000000 -0.79 8.493671e+08 93.089287 17.345857
89 DFS Discover Financial Services Financials Consumer Finance 53.619999 3.653584 1.159897 20 99 2288000000 2297000000 5.14 4.468872e+08 10.431906 -0.375934
173 IVZ Invesco Ltd. Financials Asset Management & Custody Banks 33.480000 7.067477 1.580839 12 67 412000000 968100000 2.26 4.283628e+08 14.814159 4.218620

This is a random selection of data presented in the table above. The presentation looks good but the column titles require a cleanup

In [7]:
# copying the data to another variable to avoid any changes to original data
df = stockdata.copy()
In [8]:
# fixing column names
df.columns = [c.replace(" ", "_") for c in df.columns]
In [9]:
# dropping the ticker symbol column as it is not relevant at this point.
df.drop("Ticker_Symbol", axis=1, inplace=True)

Checking the data types of the columns for the dataset

In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Security                      340 non-null    object 
 1   GICS_Sector                   340 non-null    object 
 2   GICS_Sub_Industry             340 non-null    object 
 3   Current_Price                 340 non-null    float64
 4   Price_Change                  340 non-null    float64
 5   Volatility                    340 non-null    float64
 6   ROE                           340 non-null    int64  
 7   Cash_Ratio                    340 non-null    int64  
 8   Net_Cash_Flow                 340 non-null    int64  
 9   Net_Income                    340 non-null    int64  
 10  Earnings_Per_Share            340 non-null    float64
 11  Estimated_Shares_Outstanding  340 non-null    float64
 12  P/E_Ratio                     340 non-null    float64
 13  P/B_Ratio                     340 non-null    float64
dtypes: float64(7), int64(4), object(3)
memory usage: 37.3+ KB

After dropping the Ticker symbol column, now we have 14 columns. The rows remians 340 in total while the datatypes include objects, floats and integers.

Let's take a look at the summary of the data

In [11]:
df.describe()
Out[11]:
Current_Price Price_Change Volatility ROE Cash_Ratio Net_Cash_Flow Net_Income Earnings_Per_Share Estimated_Shares_Outstanding P/E_Ratio P/B_Ratio
count 340.000000 340.000000 340.000000 340.000000 340.000000 3.400000e+02 3.400000e+02 340.000000 3.400000e+02 340.000000 340.000000
mean 80.862345 4.078194 1.525976 39.597059 70.023529 5.553762e+07 1.494385e+09 2.776662 5.770283e+08 32.612563 -1.718249
std 98.055086 12.006338 0.591798 96.547538 90.421331 1.946365e+09 3.940150e+09 6.587779 8.458496e+08 44.348731 13.966912
min 4.500000 -47.129693 0.733163 1.000000 0.000000 -1.120800e+10 -2.352800e+10 -61.200000 2.767216e+07 2.935451 -76.119077
25% 38.555000 -0.939484 1.134878 9.750000 18.000000 -1.939065e+08 3.523012e+08 1.557500 1.588482e+08 15.044653 -4.352056
50% 59.705000 4.819505 1.385593 15.000000 47.000000 2.098000e+06 7.073360e+08 2.895000 3.096751e+08 20.819876 -1.067170
75% 92.880001 10.695493 1.695549 27.000000 99.000000 1.698108e+08 1.899000e+09 4.620000 5.731175e+08 31.764755 3.917066
max 1274.949951 55.051683 4.580042 917.000000 958.000000 2.076400e+10 2.444200e+10 50.090000 6.159292e+09 528.039074 129.064585
  • The average stock current price is 80.86 dollars.
  • The average change in stock price is 4.1 percent.
  • The average volatility measures 1.53 percent.
  • The average ROE 39.6 dollars
  • The average cash ratio is 70
  • The average net cash flow is 55,537,620
  • The average net income is 1,494,385,000
  • The average earnings per share is 2.78
  • Average estimated shares outstanding is 577,028,300

The minimum cash ratio is zero.

Let's check for missing values since the minimum cash ratio is zero

In [12]:
# checking for missing values
df.isna().sum()
Out[12]:
Security                        0
GICS_Sector                     0
GICS_Sub_Industry               0
Current_Price                   0
Price_Change                    0
Volatility                      0
ROE                             0
Cash_Ratio                      0
Net_Cash_Flow                   0
Net_Income                      0
Earnings_Per_Share              0
Estimated_Shares_Outstanding    0
P/E_Ratio                       0
P/B_Ratio                       0
dtype: int64

Removing duplicate entries

In [13]:
# lets check duplicate observations
df.duplicated().sum()
Out[13]:
0

There are no duplicate entries.

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What does the distribution of stock prices look like?
  2. The stocks of which economic sector have seen the maximum price increase on average?
  3. How are the different variables correlated with each other?
  4. Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
  5. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?
In [14]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [15]:
num_cols = ['Current_Price', 'Price_Change', 'Volatility', 'ROE', 'Cash_Ratio', 'Net_Cash_Flow', 'Net_Income', 'Earnings_Per_Share', 'Estimated_Shares_Outstanding', 'P/E_Ratio', 'P/B_Ratio']
In [16]:
for num in num_cols:

  histogram_boxplot(df, num, bins=50, kde=True, figsize=(15, 8));

plt.show();
  • The stock price is significantly skewed to the right with a long tail indicating the presence of outliers.

  • The prices typically range between zero and 150 but goes higher that 1200.

  • The median is about a 50 and the average price is about 80, all in dollars

  • Price change, net cash flow, earnings per share, P/B ratio is normally distributed.

  • Volatility is skewed to the right with a long tail.

  • ROE, cash ratio, estimated shares outstanding, P/E ratio is signficantly skewed to the right with a long tail

In [17]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [18]:
labeled_barplot(df, 'GICS_Sector', perc=True)

Industrial stocks coomprise the most in the GIC sector, followed by financial stocks.

In [19]:
labeled_barplot(df, 'GICS_Sub_Industry', perc=True)

Subsector stocks dominiating include Oil& Gas Exploration & Production, REITS, Industrial Conglomerates, Electric utilities, Internet Software & Services, Health Care Equipment, MultiUtilities, and Banks

In [20]:
# Grouping by sector and calculate maximum price increase on average

avg_max_price_inc = df.groupby('GICS_Sector')['Price_Change'].mean().reset_index()
avg_max_price_inc
Out[20]:
GICS_Sector Price_Change
0 Consumer Discretionary 5.846093
1 Consumer Staples 8.684750
2 Energy -10.228289
3 Financials 3.865406
4 Health Care 9.585652
5 Industrials 2.833127
6 Information Technology 7.217476
7 Materials 5.589738
8 Real Estate 6.205548
9 Telecommunications Services 6.956980
10 Utilities 0.803657

The sectors that had the maximum average price increase were in Health Care.

In [21]:
pd.crosstab(df.GICS_Sub_Industry, df.GICS_Sector).style.highlight_max(
    color="lightgreen", axis=0
)
Out[21]:
GICS_Sector Consumer Discretionary Consumer Staples Energy Financials Health Care Industrials Information Technology Materials Real Estate Telecommunications Services Utilities
GICS_Sub_Industry                      
Advertising 2 0 0 0 0 0 0 0 0 0 0
Aerospace & Defense 0 0 0 0 0 4 0 0 0 0 0
Agricultural Products 0 1 0 0 0 0 0 0 0 0 0
Air Freight & Logistics 0 0 0 0 0 3 0 0 0 0 0
Airlines 0 0 0 0 0 5 0 0 0 0 0
Alternative Carriers 0 0 0 0 0 0 0 0 0 1 0
Apparel, Accessories & Luxury Goods 1 0 0 0 0 0 0 0 0 0 0
Application Software 0 0 0 0 0 0 2 0 0 0 0
Asset Management & Custody Banks 0 0 0 4 0 0 0 0 0 0 0
Auto Parts & Equipment 2 0 0 0 0 0 0 0 0 0 0
Automobile Manufacturers 2 0 0 0 0 0 0 0 0 0 0
Banks 0 0 0 10 0 0 0 0 0 0 0
Biotechnology 0 0 0 0 7 0 0 0 0 0 0
Brewers 0 1 0 0 0 0 0 0 0 0 0
Broadcasting & Cable TV 2 0 0 0 0 0 0 0 0 0 0
Building Products 0 0 0 0 0 4 0 0 0 0 0
Cable & Satellite 3 0 0 0 0 0 0 0 0 0 0
Casinos & Gaming 1 0 0 0 0 0 0 0 0 0 0
Computer Hardware 0 0 0 0 0 0 1 0 0 0 0
Construction & Farm Machinery & Heavy Trucks 0 0 0 0 0 3 0 0 0 0 0
Construction Materials 0 0 0 0 0 0 0 2 0 0 0
Consumer Electronics 1 0 0 0 0 0 0 0 0 0 0
Consumer Finance 0 0 0 5 0 0 0 0 0 0 0
Copper 0 0 0 0 0 0 0 1 0 0 0
Data Processing & Outsourced Services 0 0 0 0 0 0 2 0 0 0 0
Distributors 1 0 0 0 0 0 0 0 0 0 0
Diversified Chemicals 0 0 0 0 0 0 0 5 0 0 0
Diversified Commercial Services 0 0 0 0 0 1 0 0 0 0 0
Diversified Financial Services 0 0 0 7 0 0 0 0 0 0 0
Drug Retail 0 1 0 0 0 0 0 0 0 0 0
Electric Utilities 0 0 0 0 0 0 0 0 0 0 12
Electrical Components & Equipment 0 0 0 0 0 1 0 0 0 0 0
Electronic Components 0 0 0 0 0 0 2 0 0 0 0
Electronic Equipment & Instruments 0 0 0 0 0 0 1 0 0 0 0
Environmental Services 0 0 0 0 0 1 0 0 0 0 0
Fertilizers & Agricultural Chemicals 0 0 0 0 0 0 0 2 0 0 0
Financial Exchanges & Data 0 0 0 1 0 0 0 0 0 0 0
Gold 0 0 0 0 0 0 0 1 0 0 0
Health Care Distributors 0 0 0 0 3 0 0 0 0 0 0
Health Care Equipment 0 0 0 0 11 0 0 0 0 0 0
Health Care Facilities 0 0 0 0 5 0 0 0 0 0 0
Health Care Supplies 0 0 0 0 2 0 0 0 0 0 0
Home Entertainment Software 0 0 0 0 0 0 1 0 0 0 0
Home Furnishings 1 0 0 0 0 0 0 0 0 0 0
Homebuilding 2 0 0 0 0 0 0 0 0 0 0
Hotels, Resorts & Cruise Lines 4 0 0 0 0 0 0 0 0 0 0
Household Appliances 1 0 0 0 0 0 0 0 0 0 0
Household Products 0 3 0 0 0 0 0 0 0 0 0
Housewares & Specialties 1 0 0 0 0 0 0 0 0 0 0
Human Resource & Employment Services 0 0 0 0 0 1 0 0 0 0 0
IT Consulting & Other Services 0 0 0 0 0 0 3 0 0 0 0
Industrial Conglomerates 0 0 0 0 0 14 0 0 0 0 0
Industrial Gases 0 0 0 0 0 0 0 1 0 0 0
Industrial Machinery 0 0 0 0 0 5 0 0 0 0 0
Industrial Materials 0 0 0 0 0 1 0 0 0 0 0
Insurance Brokers 0 0 0 3 0 0 0 0 0 0 0
Integrated Oil & Gas 0 0 5 0 0 0 0 0 0 0 0
Integrated Telecommunications Services 0 0 0 0 0 0 0 0 0 4 0
Internet & Direct Marketing Retail 4 0 0 0 0 0 0 0 0 0 0
Internet Software & Services 0 0 0 0 0 0 12 0 0 0 0
Investment Banking & Brokerage 0 0 0 2 0 0 0 0 0 0 0
Leisure Products 2 0 0 0 0 0 0 0 0 0 0
Life & Health Insurance 0 0 0 3 0 0 0 0 0 0 0
Life Sciences Tools & Services 0 0 0 0 1 0 0 0 0 0 0
Managed Health Care 0 0 0 0 5 0 0 0 0 0 0
Metal & Glass Containers 0 0 0 0 0 0 0 1 0 0 0
Motorcycle Manufacturers 1 0 0 0 0 0 0 0 0 0 0
Multi-Sector Holdings 0 0 0 1 0 0 0 0 0 0 0
Multi-line Insurance 0 0 0 1 0 0 0 0 0 0 0
MultiUtilities 0 0 0 0 0 0 0 0 0 0 11
Networking Equipment 0 0 0 0 0 0 1 0 0 0 0
Office REITs 0 0 0 0 0 0 0 0 1 0 0
Oil & Gas Equipment & Services 0 0 3 0 0 0 0 0 0 0 0
Oil & Gas Exploration & Production 0 0 16 0 0 0 0 0 0 0 0
Oil & Gas Refining & Marketing & Transportation 0 0 6 0 0 0 0 0 0 0 0
Packaged Foods & Meats 0 6 0 0 0 0 0 0 0 0 0
Paper Packaging 0 0 0 0 0 0 0 2 0 0 0
Personal Products 0 1 0 0 0 0 0 0 0 0 0
Pharmaceuticals 0 0 0 0 6 0 0 0 0 0 0
Property & Casualty Insurance 0 0 0 8 0 0 0 0 0 0 0
Publishing 1 0 0 0 0 0 0 0 0 0 0
REITs 0 0 0 0 0 0 0 0 14 0 0
Railroads 0 0 0 0 0 4 0 0 0 0 0
Real Estate Services 0 0 0 0 0 0 0 0 1 0 0
Regional Banks 0 0 0 3 0 0 0 0 0 0 0
Research & Consulting Services 0 0 0 0 0 4 0 0 0 0 0
Residential REITs 0 0 0 0 0 0 0 0 4 0 0
Restaurants 3 0 0 0 0 0 0 0 0 0 0
Retail REITs 0 0 0 0 0 0 0 0 4 0 0
Semiconductor Equipment 0 0 0 0 0 0 1 0 0 0 0
Semiconductors 0 0 0 0 0 0 6 0 0 0 0
Soft Drinks 0 4 0 0 0 0 0 0 0 0 0
Specialized REITs 0 0 0 0 0 0 0 0 3 0 0
Specialty Chemicals 0 0 0 0 0 0 0 4 0 0 0
Specialty Retail 1 0 0 0 0 0 0 0 0 0 0
Specialty Stores 3 0 0 0 0 0 0 0 0 0 0
Steel 0 0 0 0 0 0 0 1 0 0 0
Technology Hardware, Storage & Peripherals 0 0 0 0 0 0 1 0 0 0 0
Technology, Hardware, Software and Supplies 0 0 0 0 0 1 0 0 0 0 0
Thrifts & Mortgage Finance 0 0 0 1 0 0 0 0 0 0 0
Tires & Rubber 1 0 0 0 0 0 0 0 0 0 0
Tobacco 0 2 0 0 0 0 0 0 0 0 0
Trucking 0 0 0 0 0 1 0 0 0 0 0
Water Utilities 0 0 0 0 0 0 0 0 0 0 1

Most security fall under Oil & Gas Exploration & Production sub industry & energy industry.

In [22]:
# Bivariate analysis
# Stock Price_Change Vs. GICS_Sector

plt.figure(figsize=(20,6))
sns.boxplot(data = df, y = "Price_Change", x = "GICS_Sector");
  • Materials industrysecurities had the highest price change.
  • Information technology has an unsual high price change(outlier) being the highest outlier above materials sub industry.
  • Energy industry expereinced the lowest and negative price change
  • Utlities, financials, real estate and telecummunications had the least variations in price changes.
In [23]:
# Cash_Ratio Vs. GICS_Sector

plt.figure(figsize=(20,8))
sns.boxplot(data = df, y = "Cash_Ratio", x = "GICS_Sector");

Utilities securities had the least cash ratio.

In [24]:
# P/E_Ratio Vs. GICS_Sector

plt.figure(figsize=(20,8))
sns.boxplot(data = df, y = "P/E_Ratio", x = "GICS_Sector");
In [25]:
plt.figure(figsize=(15, 7))
sns.heatmap(df[num_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Net income has a positive relationship with earnings pershare and estimated shares outstanding

  • Earnings per share has a negative relationship with ROE and Volatility.
  • Price change also has a negative relationship with volatility
  • Net income has a negative relationship with volatility
In [26]:
fig, axes = plt.subplots(5, 2, figsize=(20, 15))
fig.suptitle("CDF plot of numerical variables", fontsize=20)
counter = 0
for ii in range(5):
    sns.ecdfplot(ax=axes[ii][0], x=df[num_cols[counter]])
    counter = counter + 1
    if counter != 11:
        sns.ecdfplot(ax=axes[ii][1], x=df[num_cols[counter]])
        counter = counter + 1
    else:
        pass

fig.tight_layout(pad=2.0)
In [27]:
# Pair-plot analysis

sns.pairplot(df[num_cols],diag_kind="kde");

Most of the data is normally distributed.

Data Preprocessing¶

In [28]:
# Scaling the data set before clustering
scaler = StandardScaler()
subset = df[num_cols].copy()
subset_scaled = scaler.fit_transform(subset)
In [29]:
# Creating a dataframe from the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
In [30]:
subset_scaled_df
Out[30]:
Current_Price Price_Change Volatility ROE Cash_Ratio Net_Cash_Flow Net_Income Earnings_Per_Share Estimated_Shares_Outstanding P/E_Ratio P/B_Ratio
0 -0.393341 0.493950 0.272749 0.989601 -0.210698 -0.339355 1.554415 1.309399 0.107863 -0.652487 -0.506653
1 -0.220837 0.355439 1.137045 0.937737 0.077269 -0.002335 0.927628 0.056755 1.250274 -0.311769 -0.504205
2 -0.367195 0.602479 -0.427007 -0.192905 -0.033488 0.454058 0.744371 0.024831 1.098021 -0.391502 0.094941
3 0.133567 0.825696 -0.284802 -0.317379 1.218059 -0.152497 -0.219816 -0.230563 -0.091622 0.947148 0.424333
4 -0.260874 -0.492636 0.296470 -0.265515 2.237018 0.133564 -0.202703 -0.374982 1.978399 3.293307 0.199196
... ... ... ... ... ... ... ... ... ... ... ...
335 -0.486181 0.901646 0.540121 -0.255142 4.308162 -0.559673 -1.487784 -1.127481 0.429111 -0.082116 0.572194
336 -0.289510 -1.065766 -0.079703 1.062211 -0.476513 0.053235 -0.051186 0.029391 -0.167741 -0.337154 -0.152012
337 0.221913 0.439539 -0.206067 -0.400362 0.332009 0.164889 -0.342467 -0.303532 -0.460058 2.233634 -1.589390
338 -0.547053 -0.436811 -0.097813 -0.369243 0.320933 -0.051022 -0.301171 -0.239684 -0.377852 -0.222714 0.118680
339 -0.336453 1.051046 0.142671 -0.078803 -0.055639 0.111378 -0.293666 -0.318734 -0.092942 0.854902 0.246754

340 rows × 11 columns

In [31]:
# Pair-plot analysis

sns.pairplot(subset_scaled_df ,diag_kind="kde");

Allll attributes are standardized to a common scale, with an average of 0 and a standard deviation of 1. The dataset is devoid of any missing values or duplicated entries. While outliers have been detected, they have not undergone any correction.

  • It is important to note that in the context of modeling, these outliers are considered genuine data points rather than anomalies.

K-means Clustering¶

In [32]:
clusters = range(1, 9)
meanDistortions = []

for k in clusters:
    model = KMeans(n_clusters=k)
    model.fit(subset_scaled_df)
    prediction = model.predict(subset_scaled_df)
    distortion = (
        sum(
            np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"), axis=1)
        )
        / subset_scaled_df.shape[0]
    )

    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)

plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average distortion")
plt.title("Selecting k with the Elbow Method")
plt.show()
Number of Clusters: 1 	Average Distortion: 2.5425069919221697
Number of Clusters: 2 	Average Distortion: 2.3862098789299604
Number of Clusters: 3 	Average Distortion: 2.2652173220958
Number of Clusters: 4 	Average Distortion: 2.176396791566185
Number of Clusters: 5 	Average Distortion: 2.112997232924101
Number of Clusters: 6 	Average Distortion: 2.0574742484876594
Number of Clusters: 7 	Average Distortion: 2.0316763803182827
Number of Clusters: 8 	Average Distortion: 1.974528327675814

Approriate value for K seems to be between 3, 4 or 5.

In [33]:
sil_score = []
cluster_list = list(range(2, 10))
for n_clusters in cluster_list:
    clusterer = KMeans(n_clusters=n_clusters)
    preds = clusterer.fit_predict((subset_scaled_df))
    # centers = clusterer.cluster_centers_
    score = silhouette_score(subset_scaled_df, preds)
    sil_score.append(score)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

plt.plot(cluster_list, sil_score)
For n_clusters = 2, silhouette score is 0.43969639509980457
For n_clusters = 3, silhouette score is 0.45797710447228496
For n_clusters = 4, silhouette score is 0.4495910932182279
For n_clusters = 5, silhouette score is 0.4201772228010058
For n_clusters = 6, silhouette score is 0.39935785683828356
For n_clusters = 7, silhouette score is 0.400894762440915
For n_clusters = 8, silhouette score is 0.4213918040275059
For n_clusters = 9, silhouette score is 0.31844099775524826
Out[33]:
[<matplotlib.lines.Line2D at 0x7ddde0d5fe80>]

Four is more appropriate using the silhouette score.

In [34]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Out[34]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 7 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [35]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Out[35]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [36]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Out[36]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [37]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Out[37]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [38]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Out[38]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

Selecting final model¶

Let's take 4 as the appropriate no. of clusters as the silhouette score is high enough and there is knick at 5 in the elbow curve.

In [39]:
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(subset_scaled_df)
Out[39]:
KMeans(n_clusters=4, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=4, random_state=0)
In [40]:
# adding kmeans cluster labels to the original dataframe
df["K_means_segments"] = kmeans.labels_
In [41]:
subset_scaled_df['K_means_segments'] = kmeans.labels_

Cluster Profiling

In [42]:
cluster_profile = df.groupby('K_means_segments').mean()
In [43]:
cluster_profile['count_in_each_segments'] = df.groupby('K_means_segments')['Security'].count().values
In [44]:
cluster_profile
Out[44]:
Current_Price Price_Change Volatility ROE Cash_Ratio Net_Cash_Flow Net_Income Earnings_Per_Share Estimated_Shares_Outstanding P/E_Ratio P/B_Ratio count_in_each_segments
K_means_segments
0 234.170932 13.400685 1.729989 25.600000 277.640000 1.554927e+09 1.572612e+09 6.045200 5.783163e+08 74.960824 14.402452 25
1 38.099260 -15.370329 2.910500 107.074074 50.037037 -1.594285e+08 -3.887458e+09 -9.473704 4.803986e+08 90.619220 1.342067 27
2 50.517273 5.747586 1.130399 31.090909 75.909091 -1.072273e+09 1.483309e+10 4.154545 4.298827e+09 14.803577 -4.552119 11
3 72.399112 5.066225 1.388319 34.620939 53.000000 -1.404622e+07 1.482212e+09 3.621029 4.385338e+08 23.843656 -3.358948 277
In [45]:
fig, axes = plt.subplots(3, 4,  figsize=(20, 16))
fig.suptitle('Boxplot of numerical variables for each cluster', fontsize=20)
counter = 0
for ii in range(3):
    for jj in range(4):
        if counter <11:
            sns.boxplot(ax=axes[ii, jj],y=subset_scaled_df[num_cols[counter]],x=subset_scaled_df['K_means_segments'])
        counter = counter+1

fig.tight_layout(pad=2.0)
In [46]:
# Comparing cluster vs. GICS_Sector

pd.crosstab(df.GICS_Sector, df.K_means_segments).style.highlight_max(color = 'lightgreen', axis = 0)
Out[46]:
K_means_segments 0 1 2 3
GICS_Sector        
Consumer Discretionary 6 0 1 33
Consumer Staples 1 0 1 17
Energy 1 22 1 6
Financials 1 0 3 45
Health Care 9 0 2 29
Industrials 0 1 0 52
Information Technology 5 3 1 24
Materials 0 1 0 19
Real Estate 1 0 0 26
Telecommunications Services 1 0 2 2
Utilities 0 0 0 24

Hierarchical Clustering¶

In [65]:
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(subset_scaled_df, metric=dm, method=lm)
        c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9361573137155428
Cophenetic correlation for Euclidean distance and complete linkage is 0.8912171674298116
Cophenetic correlation for Euclidean distance and average linkage is 0.9402170033461534
Cophenetic correlation for Euclidean distance and weighted linkage is 0.891382329460296
Cophenetic correlation for Chebyshev distance and single linkage is 0.9187458528921174
Cophenetic correlation for Chebyshev distance and complete linkage is 0.8228080389704356
Cophenetic correlation for Chebyshev distance and average linkage is 0.9374227213554178
Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9145555208649849
Cophenetic correlation for Mahalanobis distance and single linkage is 0.9428343342402944
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7430655304868364
Cophenetic correlation for Mahalanobis distance and average linkage is 0.9442308978857612
Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.9064227595906567
Cophenetic correlation for Cityblock distance and single linkage is 0.9510044191059339
Cophenetic correlation for Cityblock distance and complete linkage is 0.8283141710925163
Cophenetic correlation for Cityblock distance and average linkage is 0.9203303446749941
Cophenetic correlation for Cityblock distance and weighted linkage is 0.8205165624117894
In [66]:
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Highest cophenetic correlation is 0.9510044191059339, which is obtained with Cityblock distance and single linkage
In [67]:
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for lm in linkage_methods:
    Z = linkage(subset_scaled_df, metric="euclidean", method=lm)
    c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
    print("Cophenetic correlation for {} linkage is {}".format(lm, c))
    if high_cophenet_corr < c:
        high_cophenet_corr = c
        high_dm_lm[0] = "euclidean"
        high_dm_lm[1] = lm
Cophenetic correlation for single linkage is 0.9361573137155428
Cophenetic correlation for complete linkage is 0.8912171674298116
Cophenetic correlation for average linkage is 0.9402170033461534
Cophenetic correlation for centroid linkage is 0.9350232346888858
Cophenetic correlation for ward linkage is 0.7636969819420432
Cophenetic correlation for weighted linkage is 0.891382329460296
In [68]:
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} linkage".format(
        high_cophenet_corr, high_dm_lm[1]
    )
)
Highest cophenetic correlation is 0.9402170033461534, which is obtained with average linkage
In [69]:
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]

# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(20, 40))

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(subset_scaled_df, metric="euclidean", method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(subset_scaled_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )
  • Cophenetic correlation is highest with average and centroid linkages

  • 6 appears to be the appropriate number of clusters from the dendrogram for average linkage

In [70]:
HCmodel = AgglomerativeClustering(n_clusters=6, affinity="euclidean", linkage="average")
HCmodel.fit(subset_scaled_df)
Out[70]:
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=6)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=6)
In [71]:
subset_scaled_df["HC_Clusters"] = HCmodel.labels_
df["HC_Clusters"] = HCmodel.labels_
In [72]:
cluster_profile = df.groupby("HC_Clusters").mean()
In [73]:
cluster_profile["count_in_each_segments"] = (
    df.groupby("HC_Clusters")["Security"].count().values
)
In [74]:
cluster_profile
Out[74]:
Current_Price Price_Change Volatility ROE Cash_Ratio Net_Cash_Flow Net_Income Earnings_Per_Share Estimated_Shares_Outstanding P/E_Ratio P/B_Ratio K_means_segments count_in_each_segments
HC_Clusters
0 77.287589 4.099730 1.518066 35.336336 66.900901 -3.319732e+07 1.538075e+09 2.88527 5.605050e+08 32.441706 -2.174921 2.630631 333
1 25.640000 11.237908 1.322355 12.500000 130.500000 1.675550e+10 1.365400e+10 3.29500 2.791829e+09 13.649696 1.508484 1.000000 2
2 24.485001 -13.351992 3.482611 802.000000 51.000000 -1.292500e+09 -1.910650e+10 -41.81500 5.195740e+08 60.748608 1.565141 1.000000 2
3 104.660004 16.224320 1.320606 8.000000 958.000000 5.920000e+08 3.669000e+09 1.31000 2.800763e+09 79.893133 5.884467 0.000000 1
4 1274.949951 3.190527 1.268340 29.000000 184.000000 -1.671386e+09 2.551360e+09 50.09000 5.093552e+07 25.453183 -1.052429 0.000000 1
5 276.570007 6.189286 1.116976 30.000000 25.000000 9.088500e+07 5.965410e+08 8.91000 6.695185e+07 31.040405 129.064585 0.000000 1
In [75]:
HCmodel = AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)
Out[75]:
AgglomerativeClustering(affinity='euclidean', n_clusters=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(affinity='euclidean', n_clusters=4)
In [76]:
subset_scaled_df["HC_Clusters"] = HCmodel.labels_
df["HC_Clusters"] = HCmodel.labels_
In [77]:
cluster_profile = df.groupby("HC_Clusters").mean()
In [78]:
cluster_profile["count_in_each_segments"] = (
    df.groupby("HC_Clusters")["Security"].count().values
)
In [79]:
cluster_profile
Out[79]:
Current_Price Price_Change Volatility ROE Cash_Ratio Net_Cash_Flow Net_Income Earnings_Per_Share Estimated_Shares_Outstanding P/E_Ratio P/B_Ratio K_means_segments count_in_each_segments
HC_Clusters
0 46.558126 -11.798670 2.617878 178.750000 50.250000 4.349716e+07 -3.197472e+09 -7.785312 4.732895e+08 72.496532 -0.780467 1.281250 32
1 71.846974 4.953643 1.392784 25.117216 53.831502 1.197788e+06 1.557674e+09 3.691044 4.439183e+08 23.583804 -3.087957 2.996337 273
2 229.579357 14.049986 1.735216 25.423077 268.423077 1.712688e+09 1.981882e+09 5.946923 7.219242e+08 84.216911 13.114240 0.115385 26
3 46.672222 5.166566 1.079367 25.000000 58.333333 -3.040667e+09 1.484844e+10 3.435556 4.564960e+09 15.596051 -6.354193 2.000000 9
In [80]:
# let's see the names of the securities in each cluster
for cl in df["HC_Clusters"].unique():
    print(
        "The",
        df[df["HC_Clusters"] == cl]["Security"].nunique(),
        "Securities in cluster",
        cl,
        "are:",
    )
    print(df[df["HC_Clusters"] == cl]["Security"].unique())
    print("-" * 100, "\n")
The 273 Securities in cluster 1 are:
['American Airlines Group' 'AbbVie' 'Abbott Laboratories'
 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Ameren Corp'
 'American Electric Power' 'AFLAC Inc'
 'American International Group, Inc.' 'Apartment Investment & Mgmt'
 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc'
 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp'
 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc'
 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc'
 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc'
 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom'
 'American Water Works Company Inc' 'American Express Co' 'Boeing Company'
 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.'
 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb'
 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.'
 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.'
 'Carnival Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group'
 'Church & Dwight' 'C. H. Robinson Worldwide' 'CIGNA Corp.'
 'Cincinnati Financial' 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.'
 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy'
 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.'
 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems'
 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines'
 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services'
 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company'
 'Discovery Communications-A' 'Discovery Communications-C'
 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet'
 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.'
 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.'
 "Edison Int'l" 'Eastman Chemical' 'Equity Residential'
 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade'
 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l"
 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co'
 'Fortune Brands Home & Security' 'FirstEnergy Corp'
 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems'
 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation'
 'Federal Realty Investment Trust' 'General Dynamics'
 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.'
 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber'
 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares'
 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.'
 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.'
 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company'
 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories'
 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group'
 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.'
 'J. B. Hunt Transport Services' 'Jacobs Engineering Group'
 'Juniper Networks' 'Kimco Realty' 'Kansas City Southern'
 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding'
 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.'
 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.'
 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell'
 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l."
 'Masco Corp.' 'Mattel Inc.' "Moody's Corp" 'Mondelez International'
 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.'
 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company'
 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum'
 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient'
 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)'
 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.'
 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group'
 "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes'
 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.'
 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.'
 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services'
 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.'
 'Prudential Financial' 'Phillips 66' 'Praxair Inc.' 'PayPal'
 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International'
 'Roper Industries' 'Republic Services Inc' 'SCANA Corp'
 'Charles Schwab Corporation' 'Spectra Energy Corp.' 'Sealed Air'
 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.'
 'Southern Co.' 'Simon Property Group Inc' 'Stericycle Inc'
 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.'
 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.'
 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.'
 'Thermo Fisher Scientific' 'The Travelers Companies Inc.'
 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.'
 'Total System Services' 'Texas Instruments' 'Under Armour'
 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.'
 'United Health Group Inc.' 'Unum Group' 'Union Pacific'
 'United Parcel Service' 'United Technologies' 'Varian Medical Systems'
 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust'
 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc' 'Wec Energy Group Inc'
 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co'
 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital'
 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc'
 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis']
---------------------------------------------------------------------------------------------------- 

The 26 Securities in cluster 2 are:
['Analog Devices, Inc.' 'Alliance Data Systems' 'Alexion Pharmaceuticals'
 'Amgen Inc' 'Amazon.com Inc' 'Bank of America Corp' 'BIOGEN IDEC Inc.'
 'Celgene Corp.' 'Chipotle Mexican Grill' 'Equinix' 'Edwards Lifesciences'
 'Facebook' 'First Solar Inc' 'Frontier Communications' 'Intel Corp.'
 'Intuitive Surgical Inc.' "McDonald's Corp." 'Monster Beverage'
 'Netflix Inc.' 'Priceline.com Inc' 'Regeneron' 'TripAdvisor'
 'Vertex Pharmaceuticals Inc' 'Waters Corporation' 'Wynn Resorts Ltd'
 'Yahoo Inc.']
---------------------------------------------------------------------------------------------------- 

The 32 Securities in cluster 0 are:
['Allegion' 'Apache Corporation' 'Anadarko Petroleum Corp'
 'Baker Hughes Inc' 'Chesapeake Energy' 'Charter Communications'
 'Colgate-Palmolive' 'Cabot Oil & Gas' 'Concho Resources'
 'Devon Energy Corp.' 'EOG Resources' 'EQT Corporation'
 'Freeport-McMoran Cp & Gld' 'Halliburton Co.' 'Hess Corporation'
 'Hewlett Packard Enterprise' 'Kimberly-Clark' 'Kinder Morgan'
 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc'
 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK'
 'Occidental Petroleum' 'Quanta Services Inc.' 'Range Resources Corp.'
 'S&P Global, Inc.' 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.'
 'Cimarex Energy']
---------------------------------------------------------------------------------------------------- 

The 9 Securities in cluster 3 are:
['Citigroup Inc.' 'Ford Motor' 'JPMorgan Chase & Co.' 'Coca Cola Company'
 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo'
 'Exxon Mobil Corp.']
---------------------------------------------------------------------------------------------------- 

In [81]:
fig, axes = plt.subplots(3, 4,  figsize=(20, 16))
fig.suptitle('Boxplot of numerical variables for each cluster', fontsize=20)
counter = 0
for ii in range(3):
    for jj in range(4):
        if counter <11:
            sns.boxplot(ax=axes[ii, jj],y=subset_scaled_df[num_cols[counter]],x=subset_scaled_df['HC_Clusters'])
        counter = counter+1

fig.tight_layout(pad=2.0)
In [82]:
# Comparing cluster vs. GICS_Sector

pd.crosstab(df.GICS_Sector, df.HC_Clusters).style.highlight_max(color = 'lightgreen', axis = 0)
Out[82]:
HC_Clusters 0 1 2 3
GICS_Sector        
Consumer Discretionary 1 32 6 1
Consumer Staples 2 15 1 1
Energy 23 6 0 1
Financials 1 44 1 3
Health Care 0 30 9 1
Industrials 2 51 0 0
Information Technology 2 24 7 0
Materials 1 19 0 0
Real Estate 0 26 1 0
Telecommunications Services 0 2 1 2
Utilities 0 24 0 0

INSIGHT

K-means vs Hierarchical Clustering¶

You compare several things, like:

  • Which clustering technique took less time for execution?
  • Which clustering technique gave you more distinct clusters, or are they the same?
  • How many observations are there in the similar clusters of both algorithms?
  • How many clusters are obtained as the appropriate number of clusters from both algorithms?

You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.

In [83]:
# Comparing Hierarchical cluster vs. K-means  cluster

pd.crosstab(df.K_means_segments, df.HC_Clusters).style.highlight_max(color = 'lightgreen', axis = 0)
Out[83]:
HC_Clusters 0 1 2 3
K_means_segments        
0 1 0 24 0
1 26 0 1 0
2 0 1 1 9
3 5 272 0 0
In [84]:
# PCA to reduce the data to two dimensions and visualize it to see how well-separated the clusters are

# setting the number of components to 2
pca = PCA(n_components=2)

# transforming data and storing results in a dataframe
X_reduced_pca = pca.fit_transform(subset_scaled_df)
reduced_df_pca = pd.DataFrame(
    data=X_reduced_pca, columns=["Component 1", "Component 2"]
)
In [85]:
# checking the amount of variance explained
pca.explained_variance_ratio_.sum()
Out[85]:
0.38933325456234846
In [86]:
#Scatterplot by cluster labels - K-Means Clustering
sns.scatterplot(
    data=reduced_df_pca,
    x="Component 1",
    y="Component 2",
    hue=df["K_means_segments"],
    palette="rainbow",
)
plt.legend(bbox_to_anchor=(1, 1))
Out[86]:
<matplotlib.legend.Legend at 0x7ddde21889d0>
In [87]:
# Scatterplot by cluster labels - HierarchicalClustering
sns.scatterplot(
    data=reduced_df_pca,
    x="Component 1",
    y="Component 2",
    hue=df["HC_Clusters"],
    palette="rainbow",
)
plt.legend(bbox_to_anchor=(1, 1))
Out[87]:
<matplotlib.legend.Legend at 0x7ddde137b7c0>

Actionable Insights and Recommendations¶

Exploratory Data Analysis Findings:

Stock prices and Estimated Shares Outstanding exhibit right-skewed distributions with positive outliers. Health Care and Financial sectors had notable positive Price Changes, making them attractive to investors. Information Technology and Financial sectors stand out with high Cash Ratios, enhancing their appeal. Real Estate sector offers stable investment due to consistent Price Change and Cash Ratio. Energy sector is volatile with high Price Change variance, yet holds securities with significant P/E Ratios.

Clustering Summary:

Out of 340 securities, both methods clustered 331 similarly; 9 showed differences. Clusters aligned with industry, with PCA showing consistent results. Cluster Insights:

High-performing cluster (25+ securities) led by Health Care, Consumer Discretionary, and Information Technology. Historically underperforming cluster (25+ securities) mostly from Energy. Moderately aggressive cluster (~10 securities) led by Financials. Large mildly aggressive cluster (270+ securities) diversely spread across Industrials, Financials, Consumer Discretionary, Real Estate, and Information Technology. Recommendations:

Four clusters offer diverse investment approaches based on aggression and performance. Market volatility underscores the need for ongoing analysis and dynamic clustering to refine predictions.